A stack of unpublished work. Image credit: Sear Greyson
Reproducucibility, Replicability and Going into Production
Reproducibility
Reproducibility: given the original raw data and code, can you get all of the results again?
Reproducible != Correct
“Code available on request” is the new “Data available on request”
Reproducible data analysis requires effort, time and skill.
Replicability
Replicable: if the experiment were repeated by an independent investigator, you would get slightly different data but would the substative conclusions be the same?
In the specific sense, this is the core worry for a statistician!
Also used more generally: are results stable to perturbations in population / study design / modelling / analysis?
Only real test is to try it. Control risk with shadow and parallel deployment.
Reproduction, Replication and Statistical Data Science
Monte Carlo Methods
Monte Carlo methods are used extensively in data science.
Usually solve a difficult problem (integration) or help to ensure results replicate (partitioning, sampling variability, posterior samples).
Almost always make reproduction of results more difficult. (seeding and LLN)
Optimisation
Is the optimum you find stable over:
realisations?
starting points?
step size / learning rate?
realisations of the data?
A poorly drawn contour plot. Local modes make this optimiation unstable to the choice of starting point.
Pseudo-random Number Generators
Computers are deterministic, randomness is hard.
Pesudo-random number generation.
Set starting point with set.seed().
Beware: parallel programming and language interfacing.
# different valuesrnorm(n =4)
[1] -1.1546783 -0.6895971 -1.3753532 -0.3621124
rnorm(n =4)
[1] -0.3592553 -0.9052359 0.2669273 -0.8162782
# the same valueset.seed(1234)rnorm(n =4)
[1] -1.2070657 0.2774292 1.0844412 -2.3456977
set.seed(1234)rnorm(n =4)
[1] -1.2070657 0.2774292 1.0844412 -2.3456977
Wrapping up
Reproducible: can recreate the same results from the same code and data
Replicable: core results remain valid when using different data
Stochasticity causes problems: make use of LLN and set.seed()
Be very careful with you need to be both efficient and replicable.